1 Introduction

With Google evaluating sites based on various ranking factors, knowing on which ranking factors to focus on your SEO strategy for the biggest bang is crucial.

Several large-scale data studies, mainly conducted by SEO vendors, have sought to uncover the relevance and importance of certain ranking factors*. However, in our view, the studies contain major statistical flaws. For example, the use of correlation statistics as the main instrument may render results that are misleading in the presence of outliers or non-linear associations.

Considering the methodological issues and the lack of certain ranking factors, there is a need for rock solid data formatted into clear takeaways.

1.1 Methodology

  • Step 1 Ahrefs Raw Data: As a data partner, Ahrefs provided the raw data for the analysis. The data contained 1,183,680 keywords (1,183,628 after data cleaning, for details see below) with a total number of 11,835,086 ranking URLs, of which are 10,052,136 unique URL’s (10,052,028 after data cleaning).

  • Step 3 Data Mining: We developed a data-mining script to gather data on various variables. More specificially, we collected data on Schema.org Usage, Word Count, Title, H1, Broken Links and Page Size (HTML). Due to anti-mining mechanisms, authoritative domains such as Amazon.com or youtube.com were not considered (see Section 1.2 for the number of observations we excluded for each domain). In the forthcoming sections, we refer to those as “Large Domains”. No data could be extracted for roughly 6% of the URLs due to server response errors. In total, we mined data from about 7,633,169 URLs.

  • Step 4 APIs and external data sources: In addition, the Alexa API was used to collect domain level data on the Time-on-Site and Page Speed variables. Furthermore, Clearscope.io, another data partner, collected “content scores” on 1000 high-search volume keywords (see Section 2.3.3 for a detailed explanation on that metric)

  • Step 4 Data analysis: The data has been analysed and processed for selected features to showcase whether they have a positive or negative trend on Google Ranking Positions. Polynomial regression has been applied to all numeric variables. In some cases, linear regression has been used (e.g. URL length) to provide simple average trends.

A note on chart types:

We are using three types of charts to represent the data and the trends among positions that may be considered as “non-traditional” charts. Here some notes how to read them and why we think they are helpful.

  1. Multiple probability intervals (“distribution stripes”):
    • The plot shows in a simple way how the data is distributed and allows to compare easily the distributions within and among position.
    • For each position, several bars of different color are drawn that contain X% of the values, starting at the median (somewhere in the 5% area)
    • The dark(er) stripes are basically a visual fitting and allows us to determine if there is any change in the metric with position or between large domains and other.
    • In some cases, only one or two bars are contained in the plot - this is due to the fact that lower percentages of the data do fall in very limited range of the metric, thus being invisible for our eyes.
    • Possible adjustments: Of course, we can change the number of levels (currently 6 for the data exploration and to give you the chance to choose) and their thresholds. So if you, for example, think 6 are too many and no one is interested where 25% or 75% of the data are - fine, than we create these plots with 4 levels: 5%, 50%, 95%, 100%.
  2. Point intervals with polynomial or linear fitting:
    • The plot shows, a bit similar to the distribution stripes, where the majority of data sits.
    • The dot reperesents the median value, the thick line 50% of the data and the thin line 95% of the data.
    • In some cases, only a dot can be seen - this is beacuse more than 95% of the data are equal or close to the median (and thus the lines lay behind the dot). Due to some outliers, the fitting might look a bit off (but have a closer look on the axes ranges of median and maximum - often the trend is neglectible anyway.)
    • Possible adjustments: Similar to the chart type above, we can change the interval ranges (now 50% and 95%) to any you like and also reduce or increase hteir number.
  3. Diverging range plot (don’t know if this is a name
    • Due to the complexity of the data (domain, metric and change in position), we have tried a new plot type that shows the median value of the metric (black dot) and the range from minimum to maximum (segments).
    • The colors indicate if the values above and below the average, respectively, belong to a lower or higher average position.
    • This plot shows in a simple way (once it made click in your head) which large domains have considerably higher/lower medians and/or ranges plus if an increase or decrease in that metric is correlated with an increase of average position (all or most segments right from the dots are backlino-cyan) or with an decrease (all or most segements right from the dots are purple).
    • (Detailed methods: For each domain, I calculated the mean. For each URL I assigned if the scored lower or higher than the average. Afterwards I’ve compared these means of position and colored the segments according. Example: a mean of 5, mostly URL’s on low positions score low and those on high positions high. Thus, the mean position of URL’s above the mean will be greater than the mean of URL’s below the median.)
    • Possible adjustments: Again, the length of the segments could be adjusted to represent 50%, 95% or 37.6278% of the data.

A note on the fittings (visualised in the point-range plots): - Compared to simple linear regression, polynomial fittings are a great way to capture more complex patterns in the data. However, it makes it more difficult to put hard numbers on them (since it’s not lineary scaled as 1% more -> 1 position more). - In case the polynomial fiting was close to the outcome of a simple linear regression, we used a linear regression instead to reduce complexity and provide simple, linearly scaled lifting numbers. - In some cases, the fitting does not have much explanatory power, so we decided to not include models in all cases and/or state this prominently (referring to a low R^2 for example). - Please keep in mind that several of the fittings can be misleading and/or are not or only vaguely supported. Often, the trends are driven by some URL’s that have very extreme values compared to the majority (95% or even more of the data). However, correlation does not mean causality so the reason is likely not the metric driven the pattern but other factors leading to some URL’s with extreme values scoring best (see for example backlinks and reffering domains). - Possible adjustments: + In any case, it is possible to exclude such outliers and calculate the linear ftting/lifting numbers for, let’s say, the top 95% of the data of each position. + Depending on the time left, another option would be generalized linear (mixed) effect models. With this advanced type of regression model, we would likely be able to fit a range of explanatory variables/metricsto see how they affect the response variable “position”. This way, we could directly determine the (relative) effect on the response variable and dig a bit deeper than investigating the effect/trendcorrelation of each variable on it’s own. Possbile drawbacks could be here (i) the sheer amount of data which may cause problems when fitting the model; (ii) the correlation between explanatory variables that leads to exclusion of some variables (ohterwise, effects would be “masked”) - examples here would be here backlings and referring domains, exact and partial anchor matches and likely some more; (iii) potential problems with the prerequisites needed for the model which could lead to an iteration of model runs and adjustments to find the best data transformation for each variable.

1.2 Cleaning the Data: What Information Do We Keep for Analysis?

Ahrefs Data

  • In some keywords there are less than 10 ranking URL’s → We removed 52 keywords that contained less than 5 positions.

  • Metrics provided:

    • domain rating (Domain_rating) → 11,834,947 values

    • URL rating (URL_rating) → 11,834,932 values

    • number of backlinks (backlinks) → 11,834,947 values

    • number of referring domains (refdomains) → 11,834,947 values

    • exact match (perc_exact_matches) → 11,834,969 values

    • partial match (perc_partial_matches) → 11,834,969 values

    • URL length (perc_partial_matches) → 11,834,969 values


Additional Info & Plot on NAs

  • Some metrics contain NA values:
    • Domain rating: 22 missing values
    • URL rating: 37 missing values
    • number of backlinks: 22 missing values
    • number of referring domains: 22 missing values


We have also a lot of large domains that we did not scrape and we also compare the trends for both large domains and all other URL’s. The following domains were classified as large domains:

Domain Count
en.wikipedia.org 314,512
youtube.com 299,564
amazon.com 295,468
facebook.com 230,974
pinterest.com 144,683
yelp.com 140,694
tripadvisor.com 82,815
ebay.com 76,819
reddit.com 70,614
linkedin.com 69,778
twitter.com 66,438
walmart.com 65,094
imdb.com 63,496
yellowpages.com 47,135
mapquest.com 43,779
quora.com 41,583
etsy.com 40,675
target.com 29,727
instagram.com 29,634


9,681,487 URL’s were classified as other domains.

2 Research Findings

In this section, we analyse how different ranking factors relate with higher organic positions in the Search Engine Results Pages (SERPs).

More specifiically, we look at following factors:

  • Backlink Factors
  • Domain Factors
  • Page-level Factors

2.1.2 Referring Domains

(Note: Logarithmic scale on the x axis.)


(Note: Logarithmic scale on both the x and the y axis.)


Key takeaways:

  • Almost all URL’s do not cotain any reffering domains.

  • This pattern is independent form position (see additional plots below).

  • Due to the highly skewed data, any trend found has to be treated with caution - a few URL’s with millions of backlinks drive the pattern.

Additional plots

Key takeaways:

  • The number of reffering domains show the same pattern as backlinks with more than 95% of URLs containing no referring domains at all (only dots in the point interval, only light bars in the distirbution stripes).

  • Again, the maximum range seems not to correlate with position (if than more reffering domains are found for URL’s on higher positions, but we will look at this later in more detail). !!! Was meinst du mit Maximum range? !!CED: Siehe Kommentar bei Backlinks oben.

  • The trend seems obvious but is not any trend at all - there is a difference of approx. 0.5 reffering domains (meinst du Referring domains? !!CED: Ja, sorry.) between #1 and lower positions! !!! Backlinks und referring domains sind die wichtigsten variablen, deswegen würde hier spezifischer sein. Was für ein Trend seems obvious? Positive? Bitte beschreibe genauer was du meinst. !!CED: Hm, ich hab doch geschrieben dass es kein Trend “at all” ist? Hier gilt das gleiche, wei bei den Backlinks, was wir am Telefon besprochen haben: Leider sehr skewed mit fast nur Nullen und die Trends sind nur Artefakte. Ich würde keine dieser Erkenntnise aufgreifen. Ich kann gern nochmal eine Analyse ohne die Outlier machen, da wird dann wohl rauskommen, dass es einfach null ist und sich daher auch nix mit der Position ändert.

!!! Dan: Auch wenn der Trend gering ist, würde ich diesen beschreiben. Für viele Webseitenbesitzer ist es schwer überhaupt einige Zieterungen von anderen Webseiten zu erhalten. Es hat sich um den Erwerb von Backlinks eine ganze Billion Dollar Industrie gebildet (backlinks/referring domains mehr oder weniger das gleiche: Referring domains sind quasi wie unique Backlinks). Abschließend ist das schon ein wichitiges Finding auch wenn der Wert gering ist. Das bedeutet man benötigt nur wenige Zietierungen von anderen Webseiten um höher in Google zu ranken.



Effect of Large Domains

Key takeaways:

  • On average, large domains contain 0.968 referring domains while all other URL’s contain 0.079.

  • Large domains contain 10,155 at a max while all other URL’s reach a maximum number of 7,414 referring domains. There seems to be no pattern of maximum range wih position.



2.1.3 Schema.org Usage

Key takeaways:

  • Most URL’s do not use schema markup (72.6%).

  • If there is any trend, than that #1 and #2 have slightly less often URL’s with schema markup (25.1% and 26.3%, all other positions between 27.2% and 28.1%).



Effect of Large Domains

  • Large domains come remarkably less often with schema markup than other domains.

  • The pattern is not linear but more a (very slight) U-shape: the lowest number of URL’s with schema markup have position #1 and #2 (11.6% and 16.8%) as well as #9 and #10 (16.8% and 14.3%).


  • The use of schema markup by large domains is highly skewed: Target.com uses it by fat more often than other large domains which are all below 28% (70.7% for target.com versus an average of 8.7% for all other large domains).

  • Many domains rarely or never come with schema markup.

  • Large domains that do not contain any schema markup at all are en.wikipedia.org, imdb.com, quoara.com and tripadvisor.com.


2.2 Domain Factors

2.2.1 Domain Rating

!!! x achse bitte nicht als % ausweisen. Ist einfach eine numerische Variable. Bitte auch auf andere plots anwenden. !!CED: angepasst.

Key takeaways:

  • Half of the URL’s have a domain rating below 80, half above 80.



Key takeaways:

  • Average and median domain rating increase with better position.

  • #2 has the highest average and median rating (74.3 and 84)

  • Median above 80 for #1-#4, exactly 80 for #5 and #6, and below 80 for all lower ranked URL’s (maximum median of 84 for #2, minimum median of 76 for #10)

  • URL’s of all positions cover the whole range from 0 to 100.

Effect of Large Domains

Key takeaways:

  • Large domains have remarkably higher average and median domain ratings (mean of 95.4 and median of 95) compared to all other domains (65.2 and 75).

  • The range of ratings is very narrow for large domains while it is much larger in all other.

  • Again, the whole range of possible ratings is covered in all cases.


Key takeaways:

  • Large domains have quite similar average ratings ranging between 89 and 89.

  • facebook.com has an average domain rating of 100, closely followed by the social media plattforms (twitter.com: 99, linkedin.com: 98, youtube.com: 98, instagram.com: 98, pinterest.com: 97) and en.wikipedia.org (95) and amazon.com (95).

  • The lowest score of all large domains have yellowpages.com (89), target.com (90) and walmart.com (90)

  • In general, lower ratings correlate with lower positions for most large domains with the exception of ebay.com and walmart.com.



2.2.2 Page Speed

(Note: Logarithmic scale on the x axis.)

Key takeaways:

  • On a logarithmic sale, Alexa’s daily time-on-site measure is distributed normally with a mean of 197.7 seconds.

  • The range covers below 10 and more than 10.000 seconds (~167 minutes).




(Note: We excluded the 100% range here to make the pattern better visible. A plot containing the 100% data as well can be found after the key takeaways.)

  • Median page speed is 1.65 seconds. This pattern is independent from position.

  • Also the range of speeds does not differ among positions.

  • Note: Again, the trend is driven by a few outliers with very values of page speed - I would not conclude here that better-ranked URL’s are slower. More likely, a few heavy and slow pages that are often ranked on the top 3 (for other reasons than page speed) skew the trend. This also becomes obvious when looking at the trend of median (dots) which seemsincreases (slightly) with better positioning. If you prefer, we can run a similar analysis onexcluding the 5% outlier with slow speed.


Additional plot including 100% range:


Key takeaways:

  • Most reported page speeds are below 5,000 milliseconds (5 seconds) but a few URL’s are remarkably slower with speeds up to 7M milliseconds (= 1.9 hours!!!CED: Broken URL’s/wrong reported??)

Effect of Large Domains

(Note: We excluded the 100% range here to make the pattern better visible.)

Key takeaways:

  • Large domains have remarkably smaller ranges of time on site compared to other domains.

  • However, the majority of visits on large domains were considerably longer.


Key takeaways:

  • On average, target.com and yelp.com have the slowest URL’s among all large domains (3.7 and 3.6 seconds, respectively).

  • mapquest.com and yellowpages.com report the fastest average page speed (1.4 and 1.6 seconds, respectively).

  • For most domains the range of page speed is quite narrow; however, especially target.com but also facebook.com and instagram.com have reprot a various speeds ranging from close to zero up to more than 5 (and in case of target even 10) seconds.

  • There is no trend if slower or faster page speed of large domains correlates with better position on average.



2.2.3 Time-on-Site

(Note: Logarithmic scale on the x axis.)

Key takeaways:

  • On a logarithmic sale, Alexa’s daily time-on-site measure is distributed normally with a mean of 197.7 seconds.

  • The range covers below 10 and more than 10.000 seconds (~167 minutes).




(Note: Logarithmic scale on the x axis. A plot with a simple linea axis can be found below after the key takeaways.)

Key takeaways:

  • Time on site is positively correlated with position: the time spend daily increase by ~3 seconds per position (linear model: position = 214.18 - 2.93 * time-on-site)


Additional plot without logarithmic scale:




Effect of Large Domains

(Note: Logarithmic scale on the x axis. A plot with a simple linear scale can be found below after the key takeaways.)


Key takeaways:

  • Large domains have remarkably smaller ranges of time on site compared to other domains.

  • However, the majority of visits on large domains were considerably longer.

Additional plot without logarithmic scale:


Key takeaways:

  • On average, visitors psend the longest time on facebook.com (1.035 seconds), followed by youtube.com (698), twitter.com (643) and linkedin.com (592).

  • Interestingly, the average time spend on facebook.com is almost ismilar to the maximum.

  • The lowest average time-on-site are found for yellowpages.com (124 seconds), target.com (138) and tripadvisor.com (161).

  • Notably, a range of visitors spends much more time than average on imdb.com and mapquest.com (averages: 214 and 200 seconds, maximum: 738) - both times, these URL’s are often ranked lower than the average.



2.3 Page-level Factors

2.3.1 HTML Tags (Matching of Title and H1 Tag with Keyword)

Title Match

Key takeaways:

  • URL titles contain around 65% to 85% of the keywords.



Key takeaways:

  • The median and range of matching between title and keyword is almost the same among positions with 50% of URL titles matching between 60% and 95% of the keyword.

  • The linear fitting predicts an increase of around 1% when going from position 10 to position 1.

  • Note: The support for the linear model is very low (R^2 < 0.001).



Effect of Large Domains

Key takeaways:

  • Titles of large domains match slightly worse (median of 70.7%) than other domains (median of 76.5%).

  • For large domains, the linear fitting predicts an decrease of ~2% when position increase from 10 to 1 (title match = 62.2 + 0.26 * position) while it predicts an increase of ~1% in case of other domains.

  • Note: The support for the linear models is very low - R^2 < 0.001 in both cases!


Key takeaways:

  • Titles of quora.com match particularly well with keywords (average of 86.9%), followed by pinterest.com (79.2%), linkedin.com (77.2%) and tripadvisor (75.3%). Those of youtube.com and yelp.com score on average below 50% (47.1% and 28.7%, respectively). The overall average is 67.3%.

  • There seems to be no trend per large domain: higher matching of title and keyword is sometimes associated with lower average position (purple bars), sometimes with higher average position (cyan bars).



H1 Tag Match

Key takeaways:

  • H1 tags of most URL’s contain around 60% to 80% of the keywords.



Key takeaways:

  • There is a slighlty negative trend of H1 tag matching with the keyword when URL’s are ranked at higher positions.

  • The median and range of matching between title and keyword is almost the same among positions: the median varies between 70.8 (#1 and #2) and 71.9% (#5) and 50% of URL titles match between ~60% and ~95% of the keyword.



Effect of Large Domains

Key takeaways:

  • While there is no trend of H1 tag matching in other domains, the matching corrleates with position for large domains.

  • Linear model for large domains: H1 tag match = 58.7 - 0.84 * position → A change from position 10 to position 1 is accompanied by a change from 50.3% to 57.9% in H1-keyword matching.

  • Linear model for other domains: H1 tag match = 65.6 + 0.187 * position → almost no change of H1 tag matching with position (change of <2% from position 10 to position 1).

  • Note: All linear models are not supported very well (R^2 of 0.01 for large domains and 0.0008 for other domains).


Key takeaways:

  • Among URL’s from large domains, H1 tag matching with keyworrds exhibits a high variability: the averages range between 77.9% and 36.3%.

  • URL’s of quora.com score the best(77.9%), followed by tirpadvisor.com (74.8%), yelp.com (73.3%) and linkedin.com (72.9%). URL’s of facebook.com (39.7%), yellowpages.com (38.4%) and imdb.com (36.3%) score the worst.

  • Again, there seems to be no trend of higher matching with position at the domain-level: sometimes a higher-than-average matching is associated with a lower average position (purple bars), sometimes with a higher average position (cyan bars).



2.3.2 Page Size (HTML)

Key takeaways:

  • Most URL’s have page sizes in the range of 10 to 1,00 with a mean of 156.




Key takeaways:

  • There is no correlation of page size with position.

  • Page size does not differ much among positions.

  • Most URL’s are quite small (around 100 units) and some few are very large up to and more than 60,000 (light-green bars in second plot)

  • The median page size is, mostly independent from position, around 94 (93.7 overall, range of 90.5 on #10 to 96.4 on #3).



Effect of Large Domains


Key takeaways:

  • Large domains have remarkably lower page sizes compared to other domains but a wider range for ~75% of the URL’s.


Key takeaways:

  • tripadvisor.com and youtube.com have the largest pages on average (718 and 644, respectively), yellowpages.com and en.wikipedia.org the smallest (2 and 35, respectively).

  • The largest URL’s can be found for youtube.com (3,759) and walmart.com (3,677).

  • We can see no trend for position with smaller or larger than average page size. However, at least for 3 out of the 4 domains with the largest average page size we find a higher avergae position in case the pages are larger than each domain’s average.



2.3.3 Content Score

Key takeaways:

  • Higher ocntent scores correlate with better positions.

  • On both devices, desktop and mobile, an increase of 1% in content score increase position by 1.


To focus a bit more on the main pattern, we keep only the major 50% URL’s of each position.

Key takeaways:

  • Now the trend in increase of position with content score is even more obvious.



2.3.5 Anchor Text

% Exact Matches

Key takeaways:

  • Not much to see here: More than 95% of all anchor texts did not match the keyword at all, independent of position (all bars light green).

Additional plot
Note: In this case, the segments lay all behind the dot, meaning that more than 95% of the values is (at least, but actually exactly) zero. The fitting accounts also for a few URL’s with very high values (not contained in the major 95%), thus it looks a bit off. However, we discourage you from concluding anything from this plot and thus removed it from the main report - the change is too small to be relevant. !!! Welchen plot meinst du hier? !! Den Plots mit den Punkten gneau über dem Text. (eingebunden in den “details” code), bei dem wieder mal alle Linien (50 und 95%%) hinter dem Punkt liegen.



% Partial Matches

Note: In this case, the segments lay all behind the dot, meaning that more than 95% of the values is (at least, but actually exactly) zero. The fitting accounts also for a few URL’s with very high values (not contained in the major 95%), thus it looks a bit off. However, we discourage you from concluding anything from this plot since the change is too small to explain anything at all.

Key takeaways:

  • Same pattern as with exact matches: More than 95% of all anchor texts did not even match the keyword partially, independent of position (all bars light green).

!!!Dan: Den Plot aus “Additional plots” herausnehmen. Der Plot ist interessant.

!!CED: Habe ihn verschoben. Ich hoffe mit interessant meinst du, dass es keinen Trend gibt? Wie beschrieben ist die Änderung im Promille-Bereich.



2.3.6 URL Rating

Key takeaways:

  • URL ratings are generally low in most cases with an average of 11.2.



Key takeaways:

  • Median URL rating is almost similar across positions (range: 11-12).

  • URL’s on position 1 to 6 have a median of 12, lower ranked a median of 11.

!!! Was ist mit der fitting-Kurve? Nicht notwendig hier? !!CED: Oh, das sollte so nicht sien. Bei einem Plot hat mein Laptop es nicht geschafft, wird wohl hier gewesen sein. Ist jetzt zusammen mit dem Entfernen der Prozentzeichen gefixt.

  • URL rating is on average 11.2.

Effect of Large Domains

Key takeaways:

  • Large domains score also higher than other domains (median of 15 versus 11; mean of 14.8 versus 10.4).

  • Ranges of URL rating are in general very narrow for most URL’s, no matter if belonging to a large or other domains.



Key takeaways:

  • In general not much variation in average URL rating across positions.

  • Among the top large domains with regard to URL rating are social media plattforms (facebook.com, twitter.com, instagram.com, youtube.com, linkedin.com), en.wikipedia.org and amazon.com.

  • instagram.com and en.wikipedia.org have by for the highest URL ratings with values above 70%.



2.3.7 URL Length


Key takeaways:

  • Most URL’s are between 40 and 100 characters long with a mean of 66.


Key takeaways:

  • There is a wide range of URL lengths, some with more than 2000 characters (maximum of 2075 for #8).

  • Since the pattern is quite linear, we used a simple linear regression here: length of URL = 60.43 + 1.023 * position, but very low R2 of 0.01)

  • Average length of URL’s increase with lower ranking → URL’s on #10 are on average 9.2 characters shorter (70.7) than those on #1 (61.5).

  • In general, the ranges of URL length are quite the same among positions but the maximum length increases slightly with lower ranking as well. Especially #1 and #2 have low maxima compared to the other 8 positions.

  • The majority of data (> 95%) has relatively short URL names with an overall average of ~66 characters.

  • To see the trends in more detail, we can eihter use a log scale or use the same plot with the main 75% percent per position only.

  • Range of URL length is almost the same among positions with a slight trend with shorter URL’s on top positions.



– zoom-in


Key takeaways:

  • Indeed, when focussing on the majority of URL’s the distribution of URL length is almost the same for all positions with slighlty shorter URL’s for most top ranked pages.




Additional plots Logarithmic scale:
Another way to look it bit more closely at differences in URL length.



Effect of Large Domains

Key takeaways:

  • URL’s of large domains are on average slightly shorter than other URL’s (60.8 characters versus 67.2 characters)

  • The decrease in average length is not as clear as for other domians with a increase of only 4.8 characters when comparing #1 and #10 (versus 10.9 characters for all other domains).



Key takeaways:

  • tripadvisor.com has the longest average URL’s (108.8957 characters), followed by reddit.com (85.80175 characters), walmart.com (83.84633 characters) and quora.com (77.97946 characters).

  • Some URL’s hosted by facebook.com (1,327 characters) and ebay.com (1,255 characters) consist of the most characters.

  • There is no clear pattern of URL length versus position; more often are longer URL’s correlated with lower position but not always.



2.3.8 Word Amount

Key takeaways:

  • Most URL’s contain between 100 and 10,000 words.




Key takeaways:

  • Most URL’s (> 95%) contain between 100 and 10.000 words in their body with a median of .

  • The linear fitting predicts a tiny decrease in body word amount of 2.47 words by increasing position by 1 (linear model: word amount = 1461.1 - 2.47 * position).

  • Note: The support for the linear model is super low (R^2 < 0.00001)!


Additional plot without log scale !!! Dan: Auch wenn der Trend gering ist, würde ich diesen beschreiben. Für viele Webseitenbesitzer ist es schwer überhaupt einige Zieterungen von anderen Webseiten zu erhalten. Es hat sich um den Erwerb von Backlinks eine ganze Billion Dollar Industrie gebildet (backlinks/referring domains mehr oder weniger das gleiche: Referring domains sind quasi wie unique Backlinks). Abschließend ist das schon ein wichitiges Finding auch wenn der Wert gering ist. Das bedeutet man benötigt nur wenige Zietierungen von anderen Webseiten um höher in Google zu ranken.



Key takeaways:

  • Large domains have remarkably lower median and range of words compared to other domains (median word amount: 224 versus 932; maximum word amount in body of ~62.000 versus ~1.4 million words).

  • For large domains, the linear fitting predicts an decrease of ~2% when position increase from 10 to 1 (title match = 62.2 + 0.26 * position) while it predicts an increase of ~1% in case of other domains.

  • Note: The support for the linear models is very low - R^2 < 0.001 in both cases!


Key takeaways:

  • Titles of quora.com match particularly well with keywords (average of 86.9%), followed by pinterest.com (79.2%), linkedin.com (77.2%) and tripadvisor (75.3%). Those of youtube.com and yelp.com score on average below 50% (47.1% and 28.7%, respectively). The overall average is 67.3%.

  • There seems to be no trend per large domain: higher matching of title and keyword is sometimes associated with lower average position (purple bars), sometimes with higher average position (cyan bars).




Session Info

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.1.0   forcats_0.4.0      stringr_1.4.0      dplyr_0.8.3       
##  [5] purrr_0.3.3        readr_1.3.1        tidyr_1.0.0        tibble_2.1.3      
##  [9] ggplot2_3.3.0.9000 tidyverse_1.3.0   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5  xfun_0.12         haven_2.2.0       lattice_0.20-38  
##  [5] colorspace_1.4-1  vctrs_0.2.2       generics_0.0.2    viridisLite_0.3.0
##  [9] htmltools_0.4.0   yaml_2.2.0        rlang_0.4.4       pillar_1.4.3     
## [13] glue_1.3.1        withr_2.1.2       DBI_1.0.0         dbplyr_1.4.2     
## [17] modelr_0.1.5      readxl_1.3.1      lifecycle_0.1.0   munsell_0.5.0    
## [21] gtable_0.3.0      cellranger_1.1.0  rvest_0.3.5       evaluate_0.14    
## [25] knitr_1.27        fansi_0.4.1       highr_0.8         broom_0.5.2      
## [29] Rcpp_1.0.3        scales_1.1.0      backports_1.1.5   webshot_0.5.2    
## [33] jsonlite_1.6      fs_1.3.1          hms_0.5.2         digest_0.6.23    
## [37] stringi_1.4.3     rprojroot_1.3-2   grid_3.6.2        here_0.1         
## [41] cli_2.0.1         tools_3.6.2       magrittr_1.5      crayon_1.3.4     
## [45] pkgconfig_2.0.3   xml2_1.2.2        reprex_0.3.0      lubridate_1.7.4  
## [49] assertthat_0.2.1  rmarkdown_2.1     httr_1.4.1        rstudioapi_0.11  
## [53] R6_2.4.1          nlme_3.1-142      compiler_3.6.2